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Abstract 

The linkage era left a rich legacy of pedigree samples that can be used for modern genome-wide association 
sequencing (GWAS) or next-generation sequencing (NGS) studies. Family designs are naturally equipped to detect 
rare variants, control for population stratification, and facilitate the study of parent-of-origin effects. Unfortunately, 
pedigree likelihoods are notoriously hard to compute, and current software for association mapping in pedigrees is 
prohibitively slow in processing dense marker maps. In a recent release of the comprehensive genetic analysis 
software MENDEL, we implemented an ultra-fast score test for association mapping with pedigree-based GWAS or 
NGS study data. Our implementation (a) works for random sample data, pedigree data, or a mix of both;(b) allows 
for covariate adjustment, including correction for population stratification;(c) accommodates both univariate and 
multivariate quantitative traits; and (d) allows missing values in multivariate traits. In this paper, we assess the 
capabilities of MENDEL on the Genetic Analysis Workshop 18 sequencing data. For instance, when jointly testing 
the 4 longitudinally measured diastolic blood pressure traits, it takes MENDEL less than 51 minutes on a standard 
laptop computer to read, quality check, and analyze a data set with 959 individuals and 8.3 million single- 
nucleotide polymorphisms (SNPs). Our analysis reveals association of one SNP in the q32.2 region of chromosome 
1. MENDEL is freely available on http://www.genetics.ucla.edu/software. 



Background 

Pedigree data are attractive in modern association studies 
because they permit control of population substructure 
and study of parent-of-origin effects [1]. Related affecteds 
are also more likely to share the same disease-predisposing 
gene than unrelated affecteds. The classical variance com- 
ponent model has been a powerful tool for mapping quan- 
titative trait loci in pedigrees [2]. Polygenic effects are 
effectively captured by the kinship coefficient matrix as a 
variance component. In genome-wide association sequen- 
cing (GWAS), two alleles of a single nucleotide poly- 
morphism (SNP) shift trait means and can be tested as a 
fixed effect. However, fitting a variance component model 
with pedigrees is computationally challenging, especially 
when it has to be done for a huge number of markers.We 
reexamine the computational bottlenecks and implement 
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an ultra-fast score test when pedigree structure is explicitly 
given. Score tests require no additional iteration under the 
alternative model. All that is needed is evaluation of a 
quadratic form combining the score vector and the 
expected information matrix at the maximum likelihood 
estimates under the null model. Fast pedigree GWAS is 
now implemented in our software package MENDEL [3] 
for easy use by the genetics community. In this paper, we 
demonstrate the capabilities of MENDEL on the Genetic 
Analysis Workshop 18 (GAW18) sequencing data. 

Methods 

Quantitative trait locus (QTL) association mapping typi- 
cally invokes the multivariate Gaussian distribution to 
model the observed trait values y = (y,) over a pedigree. 
The standard model (2, Chapter 8) collects the correspond- 
ing means into a vector v and the corresponding covar- 
iances into a matrix X and represents the loglikelihood of a 

pedigree as L = —— lndet E (y - v) f £ _1 (y — v) , 
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where the covariance matrix is typically parameterized as 
E = 2cr„ 2 <D + aj A 7 + ff h 2 H + ofl. (1) 

Here the variance component $ is the global kinship 
coefficient matrix capturing additive polygenic effects, and 
A 7 is a condensed identity coefficient matrix capturing 
dominance genetic effects. The household effect matrix H 
has entries hy = 1 if individuals i and j are in the same 
household and 0 otherwise. Individual environmental con- 
tributions and trait measurement errors are incorporated 
via the identity matrix J. When one tests multiple traits, 
the covariance matrix has to be properly augmented by 
matrix Kronecker products. QTL fixed effects are captured 
through the mean component v = A/3 for some predictor 
matrix A and vector of regression coefficients ft. 

To implement likelihood ratio testing, iterative maxi- 
mum likelihood estimation must be undertaken for each 
and every SNP under the alternative hypothesis. This 
unfortunate requirement is the major stumbling block 
retarding pedigree analysis. Score tests serve as conveni- 
ent substitutes for likelihood ratio tests. A careful analy- 
sis shows that the basic elements of the score statistic 
can be quickly assembled. In MENDEL [3], SNPs with 
the most impressive score test /^-values (top 50 by 
default) are further tested by the more accurate likeli- 
hood ratio method, thus achieving a good compromise 
of speed and power for large-scale QTL analysis. 

Results 

Data description 

Our analysis is based on the genotype calls for 959 indi- 
viduals (464 directly sequenced and the rest imputed) 
provided in the chrX-geno.csv.gz files. Simulated traits 
in all 200 replicates (SIMPHEN.1-200) were used for 
size and power studies in the first example. The second 
example presents results from a pedigree GWAS per- 
formed on chromosome 3 using the traits in the first 
simulation replicate (SIMPHEN.l). A whole genome 
QTL analysis for the real phenotype diastolic blood 
pressure (DBP) is presented in the final example. 

Adjustment for environmental effects 

Both the traits (blood pressures) and some environmental 
factors are measured (or simulated) on study individuals 
at 3 or 4 visits. To adjust for the environmental effects of 
Age, BPMed, Smoke, and Sex, we model the systolic 
blood pressure (SBP) by a linear mixed model (LMM): 

SBP iit = in + Age irt B Age + BPMed iit p B PMed 

+ Smokeufismoke + SeXifcex (2) 

+ (Age iit x Sexi) B Agex 

Sex + £i,ti 



where i indexes individuals, t indexes 3 time points, /3 s 
are the fixed effects, IM is an individual level random 
intercept assumed to be normal with covariance 
cov (/x;, fij) = 2<pij, and %t are independent standard normal 
errors. If we stack the traits SBP,, t into a column, this corre- 
sponds to a variance component model with a genetic com- 
ponent 2(Tg(l 3 13 (8) O), where $ is the kinship coefficient 
matrix, and an environmental component cr^hn- LMM is 
fitted by maximum likelihood (ML). 

The estimated fixed effects for traits in simulation repli- 
cate 1 are summarized in Table 1. Estimates under the lin- 
ear model (LM) are listed for comparison. Results from 
LMM imply significant additive genetic effects. The esti- 
mated heritability is 0.65 for SBP, 0.55 for DBP, and 0.63 
for Ql. Residuals from LMM will be used as the multiple 
traits in QTL association mapping. Two types of residuals 

can be used. Residuals r\ J' = SBP;, t — (/2; + x\ t P), where /<,, 
are the best linear unbiased estimate (BLUE) of the ran- 
dom intercept H-i, are decorrelated from the polygenic 
effects. QTL mapping can be performed on rj|' without 

the additive and dominant genetic components in (1). 
However, this strategy ignores the correlation between the 

longitudinal traits. Residuals rf} = SBP iit - (/a + x t it /3), 
where jl is the estimate for the grand intercept, yield the 
adjusted traits still containing the polygenic effects. QTL 
mapping using rf^ needs to keep the genetic components 
to properly capture the correlation between traits. In the 
following, we refer to the former as the decorrelated resi- 
duals (method 1) and to the latter as the correlated 
residuals (method 2). 

Size and power study (using SIMPHEN.1-200) 

Powers for detecting the 6 functional variants in the 
MAP4 gene on chromosome 3 are evaluated based on the 
provided 200 simulation replicates. Figure 1 displays the 
box plots of the 200 — fogioO?-values) for each variant 
using either the decorrelated (methodl) or the correlated 
residuals (method2). Type I errors are evaluated based on 
the provided Ql trait which is not genetically influenced. 
In general, we found that the decorrelated residuals 
(methodl) lead to higher power but also inflated type I 
error. The test using the correlated residuals (method2) 
has well-controlled type I error, high power (0.78 ~ 0.90) 
for detecting the common variants 47957996 and 
48040283 but low power for the rare variants 47913455 
and 47957741. For comparison, we also list the power and 
the size of likelihood ratio test (LRT) using correlated resi- 
duals. LRT edges out the score test in a few cases, but the 
difference is not significant.LRT is practically infeasible for 
a large number of SNPs. In the following two pedigree 
GWAS examples, we present only the results of the score 
test using correlated residuals (method 2). 
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Table 1 Summary of environmental effects for traits systolic blood pressure (top), diastolic blood pressure (middle) 
and Q1 (bottom) in simulation replicate SIMPHEN.1 
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DBP, diastolic blood pressure; SBP, systolic blood pressure. 
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Figure 1 Results of power and size study. Top: Box plots of -Iog10(p-values) from score tests for the 6 functional variants in MAP4 based on 
200 simulation replicates. The red (left) ones use the decorrelated residuals (method 1). The blue (right) ones use the correlated residuals 
(method 2). The horizontal line represents the 0.05 significance level. Bottom: Empirical power and type I error. 



Pedigree Genetic Analysis Workshopon chromosome 3 
(using SIMPHEN.1) 

We performed pedigree GWAS on all available 
sequence variants on chromosome 3 using the corre- 
lated residuals from the traits in SIMPHEN.1. A total 
of 1,213,657 SNPs passed the filtering and were sub- 
ject to testing. Figure 2 displays the run times and the 



Manhattan and quartile-quartile(QQ) plots for jointly 
testing the multivariate traits SPB. No variants passed 
the genome-wide significance level (horizontal line). 
For the null trait Ql, 5.29% of SNPs have ^-values 
less than 0.05, corroborating the correct size of the 
score test. Results for trait SBP are similar and not 
displayed. 
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(SBP^SBPa.SBPs) (DBF^ DBP 2 , DBP 3 ) Ql 

Initialization 1 min, 19 sec 1 min, 20 sec 1 min, 22 sec 

Analysis 5 min, 54 sec 5 min, 56 sec I min, 50 sec 




Figure 2 Results of pedigree genome-wide association sequencing for testing traits systolic blood pressure (SBP), diastolic blood 
pressure (DBP) and Q1 in simulation replicate SIMPHEN.1 on the 1,213,657 single-nucleotide polymorphisms on chromosome 3 and 
849 individuals. Top: Run times on a standard laptop. Bottom: Manhattan plot (left) and QQ plot (right) for the traits (DBPi,DBP 2 , DBP 3 ) 
The horizontal line represents the genome-wide significance level. Plots for SBP and Q1 are similar and are omitted here. 
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Figure 3 Results for pedigree genome-wide association sequencing of 8,348,674 single-nucleotide polymorphisms for the real 
diastolic blood pressure (DBP) traits. Top: Environmental effects fitted from linear model (LM) and linear mixed model (LMM). Numbers in 
parenthesis are p-values. Bottom: Manhattan plot (left) and guartile-quartileplot (right). The horizontal line represents the genome-wide 
significance level. 
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Analysis of real phenotypes diastolic blood pressure 

The phenotypes (SBP and DBP measured at 4 time 
points) are available for 1389 members from 20 extended 
families. The largest family contains 107 individuals; the 
smallest, 27. Genotypes at 8,348,674 SNPs were available 
on 959 of the individuals. For brevity, we only present 
results for the multivariate DBP trait here. 

We adopted the strategy discussed earlierto adjust the 
multivariate traits for the environmental factors. The table 
in Figure 3 summarizes the effects of environmental 
effects estimated by LM and LMM (2). The estimated her- 
itability of the DBP traits is 0.2564. We analyzed all SNPs 
and pedigrees together for the multivariate traits 
(DBP l ,DBP 2 ,DBP 3 ,DBP 4 ). To read in all the data and 
run standard QC procedures took 10 minutes and 14 sec- 
onds. This QC excluded 10,603 SNPs and 124 individuals 
based on genotyping success rates below 98%. The subse- 
quent ped-GWAS analysis ran in 40 minutes and 55 sec- 
onds and included all of the results plotted in Figures 3. 
The complete run never used more than 3.2 GB of RAM. 

The most significant p-value found by whole genome 
analysis was 1 x 10 _1 on chromosome 1 q32.2 region 
at 210,338,112 base pairs. No other SNPs reached gen- 
ome-wide significance. 

Conclusions 

By supplying a comprehensive, fast, and easy-to-use pack- 
age for GWAS on quantitative traits in general pedigrees, 
we hope to encourage exploitation of family-based data sets 
for gene mapping. A gene mapping study should collect as 
large a sample as possible consistent with economic con- 
straints and consistent trait phenotyping.If the sample 
includes pedigrees, all the better. Here we have argued that 
score tests can efficiently handle unrelated individuals, ped- 
igrees, or a mixture. For human studies, in whichcontrolling 
breeding is forbidden, nature has provided pedigrees segre- 
gating every conceivable genetic trait. Many of these pedi- 
grees are known from previous linkage studies and should 
be treasured as valuable resources. 
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